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Abstract 


The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investi- 
gated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal 
consistency. We show that provided AdaBoost is stopped after n!~® iterations—for sample size n 
and £ € (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk. 
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1. Introduction 


Boosting algorithms are an important recent development in classification. These algorithms belong 
to a group of voting methods (see, for example, Schapire, 1990; Freund, 1995; Freund and Schapire, 
1996, 1997; Breiman, 1996, 1998), that produce a classifier as a linear combination of base or weak 
classifiers. While empirical studies show that boosting is one of the best off the shelf classifica- 
tion algorithms (see Breiman, 1998) theoretical results do not give a complete explanation of their 
effectiveness. 

The first formulations of boosting by Schapire (1990), Freund (1995), and Freund and Schapire 
(1996, 1997) considered boosting as an iterative algorithm that is run for a fixed number of iterations 
and at every iteration it chooses one of the base classifiers, assigns a weight to it and eventually 
outputs the classifier that is the weighted majority vote of the chosen classifiers. Later Breiman 
(1997, 1998, 2004) pointed out that boosting is a gradient descent type algorithm (see also Friedman 
et al., 2000; Mason et al., 2000). 

Experimental results by Drucker and Cortes (1996), Quinlan (1996), Breiman (1998), Bauer 
and Kohavi (1999) and Dietterich (2000) showed that boosting is a very effective method, that often 
leads to a low test error. It was also noted that boosting continues to decrease test error long after the 
sample error becomes zero: though it keeps adding more weak classifiers to the linear combination 
of classifiers, the generalization error, perhaps surprisingly, usually does not increase. However 
some of the experiments suggested that there might be problems, since boosting performed worse 
than bagging in the presence of noise (Dietterich, 2000), and boosting concentrated not only on 
the “hard” areas, but also on outliers and noise (Bauer and Kohavi, 1999). And indeed, some more 
experiments, for example by Friedman et al. (2000), Grove and Schuurmans (1998) and Mason et al. 
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(2000), see also Bickel et al. (2006), as well as some theoretical results (for example, Jiang, 2002) 
showed that boosting, ran for an arbitrary large number of steps, overfits, though it takes a very long 
time to do it. 

Upper bounds on the risk of boosted classifiers were obtained, based on the fact that boosting 
tends to maximize the margin of the training examples (Schapire et al., 1998; Koltchinskii and 
Panchenko, 2002), but Breiman (1999) pointed out that margin-based bounds do not completely 
explain the success of boosting methods. In particular, these results do not resolve the issue of 
consistency: they do not explain under which conditions we may expect the risk to converge to the 
Bayes risk. A recent work by Reyzin and Schapire (2006) shows that while maximization of the 
margin is useful, it should not be done at the expense of the classifier complexity. 

Breiman (2004) showed that under some assumptions on the underlying distribution “popula- 
tion boosting” converges to the Bayes risk as the number of iterations goes to infinity. Since the 
population version assumes infinite sample size, this does not imply a similar result for AdaBoost, 
especially given results of Jiang (2002), that there are examples when AdaBoost has prediction error 
asymptotically suboptimal at tf = © (t is the number of iterations). 

Several authors have shown that modified versions of AdaBoost are consistent. These modifi- 
cations include restricting the /;-norm of the combined classifier (Mannor et al., 2003; Blanchard 
et al., 2003; Lugosi and Vayatis, 2004; Zhang, 2004) , and restricting the step size of the algo- 
rithm (Zhang and Yu, 2005). Jiang (2004) analyses the unmodified boosting algorithm and proves 
a process consistency property, under certain assumptions. Process consistency means that there 
exists a sequence (t,,) such that if AdaBoost with sample size n is stopped after t, iterations, its risk 
approaches the Bayes risk. However Jiang also imposes strong conditions on the underlying distri- 
bution: the distribution of X (the predictor) has to be absolutely continuous with respect to Lebesgue 
measure and the function Fg(X) = (1/2) In(P(Y = 1|X)/P(Y = —1|X)) has to be continuous on X. 
Also Jiang’s proof is not constructive and does not give any hint on when the algorithm should be 
stopped. Bickel et al. (2006) prove a consistency result for AdaBoost, under the assumption that the 
probability distribution is such that the steps taken by the algorithm are not too large. In this paper, 
we study stopping rules that guarantee consistency. In particular, we are interested in AdaBoost, not 
a modified version. Our main result (Corollary 9) demonstrates that a simple stopping rule suffices 
for consistency: the number of iterations is a fixed function of the sample size. We assume only that 
the class of base classifiers has finite VC-dimension, and that the span of this class is sufficiently 
rich. Both assumptions are clearly necessary. 


2. Notation 


Here we describe the AdaBoost procedure formulated as a coordinate descent algorithm and intro- 
duce definitions and notation. We consider a binary classification problem. We are given X, the 
measurable (feature) space, and Y = {—1,1}, the set of (binary) labels. We are given a sample 
Sn = {(Xi,¥;) #_, of i.i.d. observations distributed as the random variable (X,Y) ~ P, where P is 
an unknown distribution. Our goal is to construct a classifier g, : X — Y based on this sample. The 
quality of the classifier g, is given by the misclassification probability 


L(8n) = P(gn(X) FY |Sn). 


Of course we want this probability to be as small as possible and close to the Bayes risk 


L* = infL(g) = E(min{n(X),1—n(X)}), 
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where the infimum is taken over all possible (measurable) classifiers and n(-) is a conditional prob- 
ability 
nx) = PY =1|X =x). 


The infimum above is achieved by the Bayes classifier g* (x) = g(2n(x) — 1), where 


TE { 1 , x>0, 
-1 , x<0. 

We are going to produce a classifier as a linear combination of base classifiers in H = {h|h: X — 
Y}. We shall assume that class H has a finite VC (Vapnik-Chervonenkis) dimension dyc(#1) = 
max {|S| : S CX, | Afs| = 2/51}. 

AdaBoost works to find a combination f that minimizes the convex criterion 


7 Laps): 


Many of our results are applicable to a broader family of such algorithms, where the function & +> 
exp(—&) is replaced by another function @. Thus, for a function ọ : R > R*, we define the empirical 
@-risk and the @-risk, 


n 


Ron f)= ESX) and Rolf) = EY F). 


=1 


Clearly, the function @ needs to be appropriate for classification, in the sense that a measurable f 
that minimizes Rọ(f) should have minimal risk. This is equivalent (see Bartlett et al., 2006) to @ 
satisfying the following condition (‘classification calibration’). For all 0 < n <1,n 41/2, 





inf{no(a) + (1 —n)o(—a) : a(2n — 1) < 0} > inf{ne(a)+(1—n)o(—a@):aeR}. (1) 


We shall assume that ọ satisfies (1). 
Then the boosting procedure can be described as follows. 


1. Set fo = 0. Choose number of iterations t. 


2. Fork =1,...,t, set 
Sie = fk-1 + Ox 1-1, 


where the following holds for some fixed y € (0, 1] independent of k. 
Ron(fk) < Tod g (fe-1 + Oh) + (1—Y)Ron(fe-1)- (2) 
We call œ; the step size of the algorithm at step i. 
3. Output go f, as the final classifier. 


The choice of y < 1 in the above algorithm allows approximate minimization. Notice that the 
original formulation of AdaBoost assumed exact minimization in (2), which corresponds to y= 1. 
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We shall also use the convex hull of H scaled by À > 0, 


f= fs- Y Aihi,n e NU{0}, à; >0, $} i= Ahi E€ s) 


i=1 i=l 


as well as the set of k-combinations, k € N, of functions in H 


rads 


We also need to define the /,-norm: for any f E€ F 


fll, = int {$ lal, f= Y ohi, hi €E H}. 


Define the squashing function 7;(-) to be 





k 
f= $ hihi, hi € R,hi € a} 


i=1 


ko WSL, 
T(x) = x , x€[-l,]], 
-l , x<-l. 


Then the set of truncated functions is 


moF ={ff=mf)f EF}. 


The set of classifiers based on a class F is denoted by 


goF ={flf=sa(f), f EF} 
Define the derivative of an arbitrary function Q(-) in the direction of h as 


3 àh 
O'(fsh) = L l 


A=0 


The second derivative Q” (f;h) is defined similarly. 


3. Consistency of Boosting Procedure 


In this section, we present the proof of the consistency of AdaBoost. We begin with an overview. 

The usual approach to proving consistency involves a few key steps (see, for example, Bartlett 
et al., 2004). The first is a comparison theorem, which shows that as the @-risk Ro( fn) approaches RG 
(the infimum over measurable functions of Rọ), L( fn) approaches L*. The classification calibration 
condition (1) suffices for this (Bartlett et al., 2006). The second step is to show that the class of 
functions is suitably rich so that there is some sequence of elements fp for which lim; Ro( fa) = 
Rg. The third step is to show that the @-risk of the estimate f„ approaches that of the reference 
sequence f,,. For instance, for a method of sieves that minimizes the empirical @-risk over a suitable 
set F, (which increases with the sample size n), one could define the reference sequence fọ as the 
minimizer of the @-risk in ¥,. Then, provided that the sets F, grow suitably slowly with n, the 
maximal deviation over Ff, between empirical @-risk and @-risk would converge to zero. Such a 
uniform convergence result would imply that the sequence fy has @-risk converging to Ro. 
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The key difficulty with this approach is that the concentration inequalities behind the uniform 
convergence results are valid only for a suitably small class of suitably bounded functions. However 
boosting in general and AdaBoost in particular may produce functions that cannot be appropriately 
bounded. To circumvent this difficulty, we rely on the observation that, for the purposes of clas- 
sification, we can replace the function f returned by AdaBoost by any function f’ that satisfies 
sign(f’) = sign(f). Therefore we consider the clipped version 7,0 f; of the function returned by 
AdaBoost after ¢ iterations. This clipping ensures that the functions f, are suitably bounded. Fur- 
thermore, the complexity of the clipped class (as measured by its pseudo-dimension—see Pollard, 
1984) grows slowly with the stopping time f, so we can show that the @-risk of a clipped function 
is not much larger than its empirical @-risk. Lemma 4 provides the necessary details. In order to 
compare the empirical @-risk of the clipped function to that of a suitable reference sequence fy, we 
first use the fact that the empirical @-risk of a clipped function m} o f; is not much larger than the 
empirical @-risk of f. 

The next step is to relate Ron(f;) to Ren(fn). The choice of a suitable sieve depends on what 
can be shown about the progress of the algorithm. We consider an increasing sequence of /,-balls, 
and define f, as the (near) minimizer of the @-risk in the appropriate /,-ball. Theorems 6 and 8 
show that as the stopping time increases, the empirical @-risk of the function returned by AdaBoost 
is not much larger than that of f,,. Finally Hoeffding’s inequality shows that the empirical @-risks of 
the reference functions f, are close to their @-risks. Combining all the pieces, the @-risk of T} © fy 
approaches Rý, provided the stopping time increases suitably slowly with the sample size. The 
consistency of AdaBoost follows. 

We now describe our assumptions. First, we shall impose the following condition. 


Condition 1 Denseness. Let the distribution P and class H be such that 


lim inf R = R*, 
A 00 fE Fy, off) ? 


where Rg = infRọ( f) over all measurable functions. 


For many classes H, the above condition is satisfied for all possible distributions P. Lugosi and 
Vayatis (2004, Lemma 1) discuss sufficient conditions for Condition 1. As an example of such a 
class, we can take the class of indicators of all rectangles or the class of indicators of half-spaces 
defined by hyperplanes or the class of binary trees with the number of terminal nodes equal to 
d+ 1 (we consider trees with terminal nodes formed by successive univariate splits), where d is the 
dimensionality of X (see Breiman, 2004). 

The following set of conditions deals with uniform convergence and convergence of the boosting 
algorithm. The main theorem (Theorem 1) shows that these, together with Condition 1, suffice for 
consistency of the boosting procedure. Later in this section we show that the conditions are satisfied 
by AdaBoost. 


Condition 2 Let n be sample size. Let there exist non-negative sequences tn — œ, En — œ% and a 
sequence { fa}; of reference functions such that 
Rol) > R*, 


n— oo 


and suppose that the following conditions are satisfied. 
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a. Uniform convergence of t,-combinations. 


sup |Ro(f)—Ron(f)| = 0. (3) 


SEM, oF ™ eget: 
b. Convergence of empirical ¢-risks for the sequence { f,}*_,. 

max {0,Ron(fn) —Ro(In)} = 0. (4) 
c. Algorithmic convergence of t,-combinations. 


max {0, Ron (fh) —RonFn) } = 0. (5) 


Now we State the main theorem. 


Theorem 1 Assume 9 is classification calibrated and convex. Assume, without loss of generality, 


that for 9} = infye (a) (x), 
img@,= inf O(x)=0. (6) 


l 

X00 XE (—c0,00) 
Let Condition 2 be satisfied. Then the boosting procedure stopped at step ty, returns a sequence of 
classifiers f,, almost surely satisfying L(g(f,,)) — L* as n — ©. 


Remark 2 Note that Condition (6) could be replaced by the mild condition that the function Ọ is 
bounded below. 


Proof For almost every outcome @ on the probability space (Q,5,P) we can define sequences 
1 


el (œ) — 0, €2(@) — 0 and €3(@) — 0, such that for almost all œ the following inequalities are true. 
Rolne, (fin) < Ronne (fa)) HE0) by (3) 
< RonlSi,) +ECO) + 9e, (7) 
< RonIn) +E) +e, +ECO) by (5) 
< Rolfa) +E) +e, +ECO) +E) by (4). (8) 


Inequality (7) follows from the convexity of @(-) (see Lemma 14 in Appendix E). By (6) and choice 

of the sequence { f,}_, we have Ro( fn) > R* and @, — 0. And from (8) follows Re (Tg, (fi,,)) > R* 

a.s. Eventually we can use the result by Bartlett et al. (2006, Theorem 3) to conclude that 
L(g(mg,(fi,))) SL". 


But for ¢, > 0 we have g(t, (f,,,)) = 8(fin), therefore 


L(8(fi,)) OL". 


Hence, the boosting procedure is consistent if stopped after t, steps. a 


The almost sure formulation of Condition 2 does not provide explicit rates of convergence of 
L(g(f,,)) to L*. However, a slightly stricter form of Condition 2, which allows these rates to be 
calculated, is considered in Appendix A. 

In the following sections, we show that Condition 2 can be satisfied for some choices of @. We 
shall treat parts (a)—-(c) separately. 
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3.1 Uniform Convergence of t,-Combinations 
Here we show that Condition 2 (a) is satisfied for a variety of functions @, and in particular for 
exponential loss used in AdaBoost. We begin with a simple lemma (see Freund and Schapire, 1997, 
Theorem 8 or Anthony and Bartlett, 1999, Theorem 6.1): 
Lemma 3 For any t € N if dyc(H) > 2 the following holds: 

dp(F') < 2(t + 1) (dvc(H) + 1) log,[2(¢ + 1) /1In2], 


where dp( F’) is the pseudo-dimension of class F'. 


The proof of consistency is based on the following result, which builds on the result by Koltchin- 
skii and Panchenko (2002) and resembles a lemma due to Lugosi and Vayatis (2004, Lemma 2). 


Lemma 4 For a continuous function 9 define the Lipschitz constant 
Loc = inf{L|L > 0,|@(x) — 9(y)| < Lx- yl, =E < x,y < 6} 
and maximum absolute value of @(-) when argument is in [|—C,C] 


Moer = max x). 
oC ne IM )| 


Then for V = dyc(H), c= 24 fy In šde and any n, © > Oandt >Q, 





E sup IRo(f) — Ron (f)| < cCLo¢ 
femeoF" 





jf 1)(t + 1) log, [2(t+ 1)/1n2] 


n 


Also, for any ò > 0, with probability at least 1 — 6, 








< diag CUE Vie Ber ie 


n 


In(1/8 
+ Moci/ nh ) (9) 


Proof The proof is given in Appendix B. a 


sup |Ro(f) —Renlf)| 
SENOF! 





Now, if we choose ¢ and ô as functions of n, such that £7_; &(n) < œ and right hand side of (9) 
converges to 0 as n — œ, we can appeal to Borel-Cantelli lemma and conclude, that for such choice 
of C,, and 6, Condition 2 (a) holds. 

Lemma 4, unlike Lemma 2 of Lugosi and Vayatis (2004), allows us to choose the number of 
steps t, which describes the complexity of the linear combination of base functions, and this is 
essential for the proof of the consistency. It is easy to see that for AdaBoost (i.e., @(x) = e~*) we 
can choose € = «Inn and t =n!€ with « > 0, € € (0,1) and 2x—€ <0. 
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3.2 Convergence of Empirical @-Risks for the Sequence { f,,}°_,. 


To show that Condition 2(b) is satisfied for a variety of loss functions we use Hoeffding’s inequality. 


Theorem 5 Define the variation of a function Q on the interval |—a,a] (for a > 0) as 


Voa= sup Q(x)— inf (x). 


x€[—a,a] x€[—a,a] 


If a sequence { f,}_, satisfies the condition f,(x) € [—An, Àn], Yx E€ X, where Àn > 0 is chosen so 


that Vox, = 0(,/n/Inn), then 


max{0,Ron(fn) —Ro(fn)} “ 0. (10) 


n— oo 


Proof Since we restricted the range of f, to the interval [—A,, Àn], we have, almost surely, @(Y fa(X)) € 
[a,b], where b—a < Voy,. Therefore Hoeffding’s inequality guarantees that for all €, 


P (Ron(fn) —Ro(fn) = En) < exp (—2ne2/Vea, ) —§,. 


To prove the statement of the theorem we require €, = o(1) and £} ôn < œ. Then we appeal to the 
Borel-Cantelli lemma to conclude that (10) holds. These restrictions are satisfied if 


V2, =o (>) 
Àn Inn 


and the statement of the theorem follows. a 


The choice of A, in the above theorem depends on the loss function @. In the case of the 
AdaBoost loss @(x) = e~* we shall choose A, = «Inn, where «K € (0,1/2). One way to guarantee 
that the functions f,, satisfy condition f,(x) € [—An, An], Vx € X, is to choose fn € Fy,- 


3.3 Algorithmic Convergence of AdaBoost 


So far we dealt with the statistical properties of the function we are minimizing; now we turn to 
the algorithmic part. Here we show that Condition 2(c) is satisfied for the AdaBoost algorithm. We 
need the following simple consequence of the proof of Bickel et al. (2006, Theorem 1). 


Theorem 6 Let the function Q(f) be convex in f and twice differentiable in all directions h € H. 
Let Q* = lim)_,..inffeg, O(f). Assume that Yci,c2, such that Q* < c1 < c2 < %, 


0 < inf{Q"(f;h):c1 < O(f) <c2,hEe H} 
< sup{Q"(f;h): O(f) <c2,h E H} < œ. 


Also assume the following approximate minimization scheme for y € (0,1]. Define fk+1 = fk + 
Ong such that 
Ofer) SY inf Ofc + ah) + (1 =V) Qk) 
heH ,acR 


and 


O( fk+1) = m O( fk + Ahk). 
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Then for any reference function f and the sequence of functions fm, produced by the boosting 
algorithm, the following bound holds Ym > 0 such that Q( fm) > Q(f). 


n [88°(Olfo) - OP)? (ptam > 
o < 0171+ (ne) 


where li = ||f — fell,» 3 = 2(Q(fo) — Q(F))/B, B = inf{O" (fs) : OF) < Q) < O(fo),h € H}, 
B = sup{Q"(f:h) : Q(f) < Qo), h E H}. 


Proof The statement of the theorem is a version of a result implicit in the proof of (Bickel et al., 
2006, Theorem 1). The proof is given in Appendix C. a 








(1) 





Remark 7 Results in Zhang and Yu (2005, e.g., Lemma 4.1) provide similar bounds under either an 
assumption of a bounded step size of the boosting algorithm or a positive lower bound on Q" (f;h) 
forall f,h. Since we consider boosting algorithms with unrestricted step size, the only option would 
be to assume a positive lower bound on the second derivative. While such an assumption is fine for 
the quadratic loss @(x) = x’, second derivative R!(f;h) of the empirical risk for the exponential 
loss used by the AdaBoost algorithm can not be bounded from below by a positive constant in a 
general case. Theorem 6 makes a mild assumption that second derivative is positive for all f such 
that R(f) > R* (Rn(f) > Ry). 


It is easy to see, that the theorem above applies to the AdaBoost algorithm, since there we first 
choose the direction (base classifier) h; and then we compute the step size 0; as 
1, 1—e; 1, R(fi)-R (fshi 
ty Eml ab yg RUD = RUM 
2 & 2 Rfi) + R'(fishi) 
Now we only have to recall that this value of @; corresponds to exact minimization in the direction 
hj. 

From now on we are going to specialize to AdaBoost and use @(x) = e-*. Hence we drop the 
subscript @ in Rg, and Rọ and use R, and R respectively. 

Theorem 6 allows us to get an upper bound on the difference between the exp-risk of the function 
output by AdaBoost and the exp-risk of the appropriate reference function. For brevity in the next 
theorem we make an assumption R* > 0, though a similar result can be stated for R* = 0. For 
completeness, the corresponding theorem is given in Appendix D. 





Theorem 8 Assume R* > 0. Let t, be the number of steps we run AdaBoost. Let àn = «\nn, 
x € (0,1/2). Let a > 1 be an arbitrary fixed number. Let { f,}"_, be a sequence of functions 
such that fa € F,,. Then with probability at least 1—8,, where 8, = exp (—2(R*)?n'**/a*), the 
following holds 





—R*(a— a 2 „(a/(a— __ p* Ay 71/2 
Ralf) Ral) 4+ EO (nO [a Y =) l 


Proof This theorem follows directly from Theorem 6. Because in AdaBoost 


Pomoge Liei R 


i=1 


R'(f;h) = 


one 


2355 


BARTLETT AND TRASKIN 


then all the conditions in Theorem 6 are satisfied as long as R, (fn) > 0 (with Q(f) replaced by 
R,(f)) and in the Equation (11) we have B = R,,(fo) = 1, B > Ra( fa), | fo -= fall, < Àn. Since for t 
such that Rn( f+) < Ra( fn) the theorem is trivially true, we only have to notice that exp(Y; fa(X;)) € 
(0, n*], hence Hoeffding’s inequality guarantees that 





* 


Tara 7 R 
P| — Y;fn(Xi) ey = Y fa(X) ee ee —2(R* 2,,1—2k / 2 =ð, 
(iie e =; < exp ( (Ryn /a’) ; 


where we choose and fix the constant a > 1 arbitrarily and independently of n and the sequence 
{fa}. Therefore with probability at least 1 — 6, we bound empirical risk from below as Rn(fn) > 
R(fn) — R*/a > R* — R*/a = R*(a—1)/a, since R( fa) > R*. Therefore B > R*(a—1)/a and the 
result follows immediately from Equation (11) if we use the fact that R* > 0. a 


It is easy to see that choice of i,,’s in the above theorem ensures that X%_; 5, < œ, therefore 
Borel-Cantelli lemma guarantees that for t, — œ sufficiently fast (for example as O(n") for a € 
(0, 1)) 

max{0,Rn(f;,) —Rn(fn) } fe) 


If in addition to the conditions of Theorem 8 we shall require that 


R( fa) < ARU) +En, 


for some €, — 0, then together with Condition 1 this will imply R(f,) —> R* as n — œ and Condi- 
tion 2 (c) follows. 


3.4 Consistency of AdaBoost 


Having all the ingredients at hand, consistency of AdaBoost is a simple corollary of Theorem 1. 
Corollary 9 Assume V = dyc(H) <», 


lim inf R(f) = R* 
N00 fE Fy, (f) 


and t, = n'~* for e € (0,1). Then AdaBoost stopped at step t, returns a sequence of classifiers 
almost surely satisfying L(g(f;,)) > L*. 


Proof First assume L* > 0. For the exponential loss function this implies R* > 0. As was suggested 
after the proof of Lemma 4 we may choose G, = kInn for 2K — € < 0 (which also implies K < 1/2) 
to satisfy Condition 2 (a). Recall that discussion after the proof of the Theorem 8 suggests choice 
of the sequence { fn }°_, of reference functions such that fy € Fa, and 


Rf) < int RUN) +e 


for €, — 0 and A, = Klnn with x € (0,1/2) to ensure that Condition 2 (c) holds. Eventually, as 
it follows from the discussion after the proof of the Theorem 5, choice of the sequence { f,}*_, to 
satisfy Condition 2(c) also ensures that Condition 2(b) holds. Since function @(x) = e~ is clearly 
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classification calibrated and conditions of this Corollary assume Condition 1 then all the conditions 
of Theorem 1 hold and consistency of the AdaBoost algorithm follows. 

For L* = 0 the proof is similar, but we need to use Theorem 13 in Appendix D instead of Theo- 
rem 8. a 


4. Discussion 


We showed that AdaBoost is consistent if stopped sufficiently early, after t, iterations, for tn = 
O(n'-®) with e € (0,1). We do not know whether this number can be increased. Results by Jiang 
(2002) imply that for some X and function class H the AdaBoost algorithm will achieve zero 
training error after t, steps, where n/t, = o(1) (see also work by Mannor and Meir (2001, Lemma 
1) for an example of X = R? and H = {linear classifiers}, for which perfect separation on the 
training sample is guaranteed after 8nInn iterations), hence if run for that many iterations, the 
AdaBoost algorithm does not produce a consistent classifier. We do not know what happens in 
between O(n!~*) and O(n7 Inn). Lessening this gap is a subject of further research. 

The AdaBoost algorithm, as well as other versions of the boosting procedure, replaces the 0 — 1 
loss with a convex function @ to overcome algorithmic difficulties associated with the non-convex 
optimization problem. In order to conclude that Ro( fn) > Rọ implies L(g(fn)) > L* we want ọ 
to be classification calibrated and this requirement cannot be relaxed, as shown by Bartlett et al. 
(2006). 

The statistical part of the analysis, summarized in Lemma 4 and Theorem 5, works for quite 
an arbitrary loss function @. The only restriction imposed by Lemma 4 is that @ must be Lipschitz 
on any compact set. This requirement is an artifact of our proof and is caused by the use of the 
“contraction principle”. It can be relaxed in some cases: Shen et al. (2003) use the classification 
calibrated loss function 


2 , x<0, 
y(x)=4 l-x , 0<x<1, 
0 eee aon 


which is non-Lipschitz on any interval [—A,A], à > 0. 

The algorithmic part, presented by Theorems 6 and 8, concentrated on the analysis of the expo- 
nential (AdaBoost) loss @(x) = e~*. This approach also works for the quadratic loss @(x) = (1 —x)?. 
Theorem 6 assumes that the second derivative Ro (f;h) is bounded from below by a positive con- 
stant, possibly dependent on the value of Rg(f), as long as Ro(f) > Ro. This condition is clearly 
satisfied for (x) = (1 —x)?: Rọ(f;h) = 2 and we do not need an analog of Theorem 8; Theorem 6 
suffices. Lemma 4 can be applied for the quadratic loss with Lg, = 2(1 +À) and Mg, = (1 +A). 
We may choose tn, Àn, Én the same as for the exponential loss or set A, = ni/4-% 91. E€ (0, 1/4), 
Cn =n°-®2, D = (0,p), p = min(e/2, 1/4) to get the following analog of Corollary 9. 


Corollary 10 Assume @(x) = (1 —x)*. Assume V = dyc(H) < %, 


lim inf R(f) = R* 
Aco fE Fy, (f) 


and t, =n'~® for e € (0,1). Then boosting procedure stopped at step tn returns a sequence of 
classifiers almost surely satisfying L(g(f;,)) > L*. 
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We cannot make analogous conclusion about other loss functions. For example for logit loss 
@(x) = In(1 +e“), Lemma 4 and Theorem 5 work, since Lg, = 1 and Mg, = In(1 + e*), hence 
choosing tn, Àn, Én as for either the exponential or quadratic losses will work. The assumption of 
the Theorem 6 also holds with R@,,({31) = Ron(f)/n, though the resulting inequality is trivial: the 
factor 1/n in this bound precludes us from finding an analog of Theorem 8. A similar problem 
arises in the case of the modified quadratic loss @(x) = [max(1 —x,0)]*, for which Rọ „(f;h) > 2/n. 
Generally, any loss function with “really flat” regions may cause trouble. Another issue is the 
very slow rate of convergence in Theorems 6 and 8. Hence further research intended either to 
improve convergence rates or extend the applicability of these theorems to loss functions other than 
exponential and quadratic is desirable. 
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Appendix A. Rate of Convergence of L(g(/;,,)) to L* 


Here we formulate Condition 2 in a stricter form and prove consistency along with a rate of conver- 
gence of the boosting procedure to the Bayes risk. 





Condition 3 Let n be sample size. Let there exist non-negative sequences tn — %, n —> %, 5; +0 
such that Y=, 8} < œ, j=1,2,3, ek — 0, k = 1,2,3, a sequence { fy}*_, of reference functions such 
that 


Rolfa) m R*, 


n— oo 


and the following conditions hold. 


a. Uniform convergence of t,-combinations. 


r( sup Rl) Raa >t) <8 (12) 


SEN, OF ™ 
b. Convergence of empirical ¢-risks for the sequence { f,,}*_,. 
P (Ron( fr) —Ro(fn) > £) < ©. (13) 
c. Algorithmic convergence of t,-combinations. 
F 3 3 
P (Ron( fn) —Ron(fn) > En) < p- (14) 
Now we state the analog of Theorem 1. 
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Theorem 11 Assume Ọ is classification calibrated and convex, and for Q} = iMfye{—1,0) (x) without 
loss of generality assume 
lim ọ = inf o(x)=0. (15) 
hse x€(—c0,00) 
Let Condition 3 be satisfied. Then the boosting procedure stopped at step ty, returns a sequence of 
classifiers f,, almost surely satisfying L(g(f,,)) — L* as n — œ. 


Proof Consider the following sequence of inequalities. 


Ro(t,(fn)) < Rong fa) ) FE by (12) (16) 
< Ron fin) +8, +96, 
< Ron(fn) Hen + Oc, +E by (14) (17) 
< Rolfa) telt Og, +e, te, by (13). (18) 


Inequalities (16), (18) and (17) hold with probability at least 1 — ôl, 1 — 62 and 1 — ô respectively. 
We assumed in Condition 3 that Ro(f,) — R* and (15) implies that Pe, — 0 by the choice of the 
sequence a. Now we appeal to the Borel-Cantelli lemma and arrive at Ro(Tz,(fi,,)) > R* a.s. 
Eventually we can use Theorem 3 by Bartlett et al. (2006) to conclude that 


L(8(Tg, Sn) OL": 


But for n > 0 we have g(t, (f;,,)) = 8( fin), therefore 


L8 fa) ) OL". 


Hence the boosting procedure is consistent if stopped after t, steps. a 


We could prove Theorem 11 by using the Borel-Cantelli lemma and appealing to Theorem 1, 
but the above proof allows the following corollary on the rate of convergence. 


Corollary 12 Let the conditions of Theorem 11 be satisfied. Then there exists a non-decreasing 
function vy, such that y(0) = 0, and with probability at least 1 — 8! — 8 — 83 


LED-I sw! (eitez ei toe) +È inf RoR) ) (9) 
dn 


where ~! is the inverse of W. 


Proof From Theorem 3 of Bartlett et al. (2006), if © is convex we have that 


vie) = 9(0) —int{ FP g(a) 445° 





o(-—Q):a€ RI : 
and for any distribution and any measurable function f 
L(g(f)) -—L* < Ww" (Ro(f) — Ro) - 
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On the other hand, 
R —R*=(R — inf Ro) +| inf Ro- R; l. 
(f) -R= (Ro(P)— inf Ro) + ( inf Ry—R5 


The proof of Theorem 11 shows that for function f,, with probability at least 1 — 6! — 8? 


Ro(fi,) — inf Ro <€, +E, +E + Og,- 
FEF an 
Putting all the components together we obtain (19). a 


The second term under y~! in (19) is an approximation error and, in a general case, it may 
decrease arbitrarily slowly. However, if it is known that it decreases sufficiently fast, the first term 
becomes an issue. For example Corollary 9, even if the approximation error decreases sufficiently 


fast, will give a convergence rate of the order O | (Inn)~ t) . This follows from Example 1 by Bartlett 


et al. (2006), where it is shown that for AdaBoost (exponential loss function) y eth (x) < V 2x, and the 
fact that both £} and £2, as well as Pg,» in Corollary 9 decrease at the rate O(n'~*) (in fact, @’s might 


be different for all three of them), hence everything is dominated by £3, which is O ((inn)-3). 


Appendix B. Proof of Lemma 4 


For convenience, we state the lemma once again. 
Lemma 4 For a continuous function 9 define the Lipschitz constant 


Log = inf{L|L > 0,|9(2) — 9(y)| < Lẹx = yl, =E < x,y SG} 


and maximum absolute value of @(-) when argument is in |—C,C] 


Moc = max x). 
og = max, ls) 


Then for V = dyc(H), c= 24 fy /In 8¢de and any n, C>Oandt > 0, 








ya 1)(t + 1) log, [2(t + 1)/In2] 


E sup |Ro(f) —Ron(f)| < cblo n 


fEng F' 


Also, for any ò > 0, with probability at least 1 — 6, 








que 1)(t +1) log, [2(t + 1) /In2] 


sup [Ro(f) —Ron (f)| < cbLo,t n 


fEnçof' 
In(1/8) 

2n ` 
Proof The proof of this lemma is similar to the proof of Lugosi and Vayatis (2004, Lemma 2) in 
that we begin with symmetrization followed by the application of the “contraction principle”. We 
use symmetrization to get 





+ Moc 


DCE o 


E sup |Ro(f)—Ren(f)] < 2E sup |- 
i=1 


fEnof' SEM OF! 
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where 6; are i.i.d. with P(o; = 1) = P(o; = —1) = 1/2. Then we use the “contraction principle” 
(see Ledoux and Talagrand, 1991, Theorem 4.12, pp. 112-113) with a function w(x) = (Q(x) — 


~(0))/Lo,¢ to get 











E sup |Ro(f)—Ron(f)| < 4LocE sup ay —oiY;f (X, 
femgoFt fengoF! |” i 
= 4LọctE sup oif (X; 
0,6 feno F" E 








Next we proceed and find the supremum. Notice, that functions in tz o ¥‘ are bounded and clipped 
to absolute value equal ¢, therefore we can rescale ty o F' by (2¢)~! and get 


ca sup 
fE) -lono F! 


sup 


area nN: 





au oif (X, 








hora 





Next, we use Dudley’s entropy integral (Dudley, 1999) to bound the right hand side above 


n 


*Y o;f(x) 


n iZ 





E sup 
JEL) longo F# 








< = f [MNE (26)! om 0 F* Lal P,) ae 


Since, for € > 1, the covering number N is 1, the upper integration limit can be taken as 1, and we 
can use Pollard’s bound (Pollard, 1990) for F C [0, 1]*, 


where dp(F) is a pseudo-dimension, and obtain for č = 12 fy /In Bde, 





E sup 
FEE) longo F" 


A 
Nr 





(a, 








n= 
I= 


Also notice that constant č does not depend on ¥' or C. Next, since (2€)~! o Tọ is non-decreasing, 
we use the inequality dp((2C)~' ongo F) < dp(F") (for example, Anthony and Bartlett, 1999, 
Theorem 11.3) to obtain 


Pe cael 


n 


E sup 
SE(26)- 1 ongo Ft 


= {ores 








And then, since Lemma 3 gives an upper-bound on the pseudo-dimension of the class F*, we have 





E sup 
feno f' 


„Losa 





a 


2 |< ty! (V +1)(t+ 1)log,[2(t + 1)/1n2] 


n 








ni; 
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with the constant c above being independent of H, t and C. To prove the second statement we use 
McDiarmid’s bounded difference inequality (Devroye et al., 1996, Theorem 9.2, p. 136), since for 
alli € {1,...,n} 


M, 
sup sup R-RE, 
yay) |femoFt femoF" n 





where Rọ ,(f) is obtained from Rọ, (f) by changing each pair (x;,y;) to an independent pair (x;,y;)- 
This completes the proof of the lemma. a 


Appendix C. Proof of Theorem 6 


For convenience, we state the theorem once again. 
Theorem 6 Let the function Q(f) be convex in f and twice differentiable in all directions h € H. 
Let Q* = lim)_,..inf reg, O(f). Assume that Yc, c2, such that Q* < c1 < c2 <%, 


0 < inf{Q"(f;h):c1 < O(f) <c2,hE H} 
< sup{Q"(f;h): O(f) < c2,h E H} < œ. 
Also assume the following approximate minimization scheme for y € (0,1]. Define fx+1 = fk + 


Qk+1hk+ısuch that 
Olfer) SY inf Q(f,+oh)+(1—y)Q(f) 
heH,acR 


and 
Ql fk+1) = Di O( fk + Ohr). 


Then for any reference function f and the sequence of functions fm, produced by the boosting 


algorithm, the following bound holds Ym > 0 such that Q( fm) > Q(f). 


= [BE OA? (, Btm? 
EGERN (nt) i, 


where lr = || f — fell > c3 = 2(Q(fo) — Q(F))/B, B = inf{O" (f;h) : OCF) < O(f) < Olo), h € H}, 
B = sup{ Q" (f;h) : Q(f) < Q(fo),h E H}. 








Proof The statement of the theorem is a version of a result implicit in the proof of Theorem 1 
by Bickel et al. (2006). If for some m we have Q(fm) < Q(f), then the theorem is trivially true 
for all m' > m. Therefore, we are going to consider only the case when Q(fm) > Q(f). We shall 


also assume Q(fin+1) > Q(f) (the impact of this assumption will be discussed later). Define €m = 


Q(fm) — Q(f). By convexity of Q(-), 
|O' (fini. fn- P| > Em. (20) 


Let fn- f = X õihħi, where &; and h; correspond to the best representation (with the /;-norm of & 
equal the /,-norm). Then from (20) and linearity of the derivative we have 


Em S |" 6:0" (fins ha) | < sup |O' (FmsA)| È |G, 
heH 
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therefore 


€ E 
su Q' m; h > u = M 
eS eal 0 





(21) 


Next, 
1 “4 
QO Fn + Aim) = O( fn) + 4D! Fini ltm) + 50° O" (Fn him), 
where fm = fin + Omlm, for Gn € (0, Qn]. By assumption fm is on the path from fm to fm+1, and 


we have assumed exact minimization in the given direction, hence fm+1 is the lowest point in the 
direction hm starting from fm, so we have the following bounds 


O(f) < O( fin+1) < O( fn) < O( fm) < Q(fo): 


Then by the definition of B, which depends on Q( f), we have 





1 / m Am z 
Ont) > OUFn) + inf (UO (fn lim) + EB) = On) — LE o 
acR 2 2B 
On the other hand, 
Q(fn t Amhm) < Yani ee + ah) + (1—¥)O(fm) 
Hea eo? 2 
< y int, (Om) +00! fat) + 50°B) ) + (1 -DO 
= supjest |Q (fms h)|? 
= Q(fm)—¥ 2B . (23) 
Therefore, combining (22) and (23), we get 
/ $ / A yp 
|O Sms hm)| 2 sup |Q (Smh) ly (24) 
heH 


Another Taylor expansion, this time around fm+ı (and we again use the fact that fm+1 is the mini- 
mum on the path from fm), gives us 


1 =% 
Q( fm) = O(fm+1) + 5 in" Fini tm); (25) 


where Fn is some (other) function on the path from fin to fn41. Therefore, if [Om| < \/Y|O! (fini hm)|/B, 

then , 7 
Q (fm;hħ 

QO(fin) T Q(fn+1) < a 


but by (23) 





ysuppest lO (fins)? — YO" (SEn hm)? 
OG) = Oln) = OB > fae 


therefore we conclude, by combining (24) and (21), that 


VTO’ (Simihm)| . Y/Bsupnest lO fmsh)| . emv/B 
B = B3/2 = ,,B3/2 ` 





[Om| > (26) 
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Using (25) we have 
Eo? < 5 ECOG) - OU) < FO) - O17) en 
Recall that 
l-Alla < No F E los 


A 


m—1 1/2 
< a-il va E e) | 
i=0 


therefore, combining with (27) and (26), since the sequence €; is decreasing, 


m 


(Q(fo)-Q(f)) => Lo 
i=0 


PB 
2 php 


TIN 
































m 1 
> Oe = 
ir (o+ vi (Eo) ) 
2 Ta - mn 1/2\ ? 
i=0 (% + vi (22o) ) 
yB 2 1 
Se ar 
2B3 b g oD; 
Since 7 = 
F 1 a dx Pana an 
at bi o a+bx b a 
ae 2(Q(fo)—Q(F)) 
2 z ve plot a pe eed) 
a (Q(fo) — Q(f)) = ——e*, In . 
pL- > EO) @ 
Therefore 
= 2(Q(fo)—O(F)) z 
5 2: RNO [ot =p EY) 5 
m > E 2 


The proof of the above inequality for index m works as long as Q(fm+1) > Q(f). If f is such that 
Q(fm) > Q(f) for all m, then we do not need to do anything else. However, if there exists m’ such 
that O( fim’) < Q(f) and Q(fm—1) > Q(f), then the above proof is not valid for index m’ — 1. To 
overcome this difficulty, we notice that Q(fw—1) is bounded from above by Q(fin'—2), therefore 
to get a bound that holds for all m (except for m = 0) we may use a bound for €,,_; to bound 
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QO( fin) — O(f) = €m: shift (decrease) the index m on the right hand side of (28) by one. This com- 
pletes the proof of the theorem. a 


Appendix D. Zero Bayes Risk 


Here we consider a modification of Theorem 8. In this case our assumptions imply that R* = 0, 
and the proof presented above does not work. However for AdaBoost we can modify the proof 
appropriately to show an adequate convergence rate. 


Theorem 13 Assume R* = 0. Let t, be a number of steps we run AdaBoost. Let Àn = Klnlnn 
for « € (0,1/6). Let €, = n`”, for v € (0,1/2). Then with probability at least 1 — õn, where 
õn = exp (—2n'-™ /(Inn)**), for some constant C that depends on H and P but does not depend 


onn, for n such that 


C 3 2 
(Inn)* nY 





the following holds 


Ralfa) < Ralfa) 
| EOR RO Pn” 











cy 
(in (ninn)? +4R,( fo) soa 
(xlnlnn)? l 


Proof For the exponential loss assumption R* = 0 is equivalent to L* = 0. It also implies that the 
fastest decrease rate of the function t: 4 — inf rer, R(f) is O(e~*). To see this, assume that for 
some À there exists f € Fy such that L(g(f)) = 0 (i.e., we have achieved perfect classification). 
Clearly, for any a > 0 


R(af) = Ee” —B (erro) > (inte Fa, 
xy 


Therefore, choose A, = KInInn. Then infrey, R(f) > C(Inn)~*, where C depends on H and P, 
but does not depend on n. On the other hand Hoeffding’s inequality for fn € F}, guarantees that 


P (R( fn) — Ra (fa) > En) < exp (—2ne; /(Inn)**) = ôn. 


Choice of €, =n’ for v € (0,1/2) ensures that ô — 0. This allows to conclude that with proba- 
bility at least 1 — &ņ„ empirical risk R„ (fn) can be lower bounded as 


Ril fn) = R(Ufn) — En 


and for n large enough for 
C 2 


> 
(Inn) nY 
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to hold we get a lower bound on f in (11) of Theorem 6 as 


C 


P ann 





Since for f, such that R, (fa) > Rn(fo) theorem trivially holds, we only have to plug R„(fn) = 0, 
B=R,,(fo) and B = C(Inn)*/2 into (11) to get the statement of the theorem. Obviously, this bound 
holds for R* > 0. a 


Appendix E. 
Lemma 14 Let the function 9: R — RU {0} be convex. Then for any > 0 


O(mA(H)) Sat inf, ole) 29) 


Proof If x € [—A,A] then the statement of the lemma is clearly true. Without loss of generality 
assume x > A; case x < —A is similar. Then we have two possibilities. 


1. @(x) > (À) = O(m(x)) and (29) is obvious. 


2. O(x) < @(A). Due to convexity, for any z < À we have (z) > (A), therefore 


oma) = P0) <A) +(x) = inf, 912) + 9(0). 


The statement of the lemma is proven. a 
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